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Abstract. We consider models of nucleotidic substitution processes 
where the rate of substitution at a given site depends on the state of 
the neighbours of the site. We first estimate the time elapsed between 
an ancestral sequence at stationarity and a present sequence. Second, 
assuming that two sequences are issued from a common ancestral se- 
quence at stationarity, we estimate the time since divergence. In the 
simplest nontrivial case of a Jukes-Cantor model with CpG influence, 
we provide and justify mathematically consistent estimators in these 
two settings. We also provide asymptotic confidence intervals, valid for 
nucleotidic sequences of finite length, and we compute explicit formulas 
for the estimators and for their confidence intervals. In the general case 
of an RN model with YpR influence, we extend these results under a 
proviso, namely that the equation defining the estimator has a unique 
solution. 



Introduction 

A crucial step in the computation of phylogenetic trees based on aligned 
DNA sequences is the estimation of the evolutionary times between these 
sequences. In most phylogenetic algorithms based on stochastic substitution 
models, one assumes that each site evolves independently from the others 
and. in general, according to a given Markovian kernel. This assumption 
is mainly due to the difficulty to work without the assumption of indepen- 
dence. To understand why note that the distribution of the nucleotide at 
site j at a given time depends a priori on the values at previous times of the 
dinucleotides at sites i — 1 and i + 1, whose joint distributions, in turn, may 
depend on the values of some trinucleotides, and so on. Hence, one is faced 
with infinite-dimensional linear systems, which are generically hard to solve. 
Besides, the magnitude of the effect of the neighbours on the substitution 
rates can be large. Since some neighbour influences are well documented 
in the literature, and caused by well known biological mechanisms, it seems 
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necessary to take into account the neighbour influences in substitution mod- 
els. To wit, a class of mathematical models with neighbour influences was 
recently introduced by biologists, see [3], and studied mathematically, see 
P. 

The goal of the present paper is to show that one can compute consistent 
estimators of the distances between DNA sequences whose evolution is ruled 
by models with influence in a specific class of models. 

We completely describe the construction in the simplest non trivial case, the 
Jukes-Cantor model with (symmetric) CpG influence and we explain in the 
appendix how to extend our construction to every model in the class. 

In section [TJ we describe the Jukes-Cantor model with CpG influence, the 
simplest one of the class of manageable models introduced in [JJ, and its main 
properties. In section [21 we summarize our main results on the estimation of 
the elapsed time between an old DNA sequence and a present one, and on 
the time since two present DNA sequences issued from the same ancestral 
sequence diverged. The appendix contains the extension of the results of 
section [2 In the other sections we prove our results. At the end of section [21 
is a plan of the rest of the paper. 

1. Models with influence 

We first describe the Jukes-Cantor model with CpG influence which the 
results of this paper apply. Then, we mention its main mathematical prop- 
erties, already established in [T|, and we introduce some notations. 

Recall that DNA sequences are encoded by the alphabet srf = {A, T, C, G}, 
where the letters stand for Adenine, Thymine, Cytosine and Guanine re- 
spectively. Thus, bi-infinite DNA sequences are encoded as elements of srf^ 
where Z is the set of integers. 

1.1. Jukes-Cantor model with CpG influence (JC+CpG). In most 
models of DNA evolution, one assumes that each site evolves independently 
from the others and follows a given Markovian kernel, see [9], |10] , [3j and [6j 
for instance. Even in codon evolution models, see [8], one often assumes that 
different codons evolve independently, with however some exceptions such 
as [TJ. On the other hand, it is a well known experimental fact, see [2] by 
example, that the nature of the close neighbours of a site can modify, notably 
in some cases, the substitution rates observed at this site. To take account 
of these observations, we consider models, in continuous time, where the 
sequence evolves under the combined effect of two superimposed mechanisms. 

The first mechanism is an independent evolution of the sites as in the usual 
models. Hence it is characterized by a 4 x 4 matrix of substitution rates, 
each rate being the mean number of substitutions per unit of time. The 
simplest case is the Jukes-Cantor model, where each substitution happens 
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at the same rate. Hence, the rate of the substitutions of x by y is set to 1, 
for every nucleotides x and y in s&. 

A second mechanism is superimposed, which describes the substitutions due 
to the influence of the neighborhood: the most noticeable case is based 
on experimentally observed CpG-methylation-deamination processes, whose 
biochemical causes are well known. Hence we assume that the substitution 
rates of cytosine by thymine and of guanine by adenine in CpG dinucleotides 
are both increased by an additional nonnegative rate r. 

This means for example that any C site whose right neighbour is not occupied 
by a G, changes at global rate 3, hence after an exponentially distributed 
random time with mean 1/3, and when it does, it becomes an A, a G or 
a T with probability 1/3 each. On the contrary, any C site whose right 
neighbour is occupied by a G, changes at global rate s = 3 + r, hence after 
an exponentially distributed random time with mean 1/s, and when it does, 
it becomes an^4, a G or a T with unequal probabilities 1/s, 1/s, and (l+r)/s 
respectively. 

The case r = corresponds to the usual Jukes-Cantor model. As soon as 
r^O, the evolution of a site is not independent of the rest of the sequence. 
Hence the evolution of the complete sequence is Markovian (on a huge state 
space), but not the evolution of a given site, nor of any given finite set of 
sites. 

Recall from [TJ that the relevant class of models, called RN+YpR in this 
paper, is in fact larger than just described. 

As already mentioned, the results of this paper about Jukes-Cantor models 
with CpG influence (hereafter denoted JC+CpG) are adapted to every RN 
model with YpR influence (hereafter denoted RN+YpR) in the appendix. 

1.2. Main properties. We work on the space stf' L with the topology prod- 
uct and the cylindric cr-algebra defined as the smallest er-algebra such that 
every projection on is measurable. 

We now recall some results of [lj, valid for every RN+YpR model. First, for 
every probability measure v on there exists a unique Markov process 
(X(t))t^o on with initial distribution u, associated to the transition 

rates above. Thus, for every time t, X(t) describes the whole sequence and, 
for every i in Z, the ith. coordinate Xi(t) of X(t) is the random value of 
the nucleotide at site i and time t. Under a non-degenaracy condition on 
the rates of the model, the process (X(i))t^o is ergodic, its unique stationary 
distribution tt on is invariant and ergodic with respect to the translations 
of Z, and n puts a positive mass on every finite word w = (iOi)o<i^€ written 
in the alphabet srf ' . The notation ir(w) is abusive because tt is a measure on 
£f z but it is a shorthand for 7t(Hq 1({w})), where Hq^ is such that for every 
x £ ,<2f z , Ko,e{x) = (a?i)o<i$e- 
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Furthermore, for every position i in Z, Pj,(.X~j : j + ^) = u;) converges to n(w) 
when i — )■ +00, where F u stands for the probability under the initial measure 
v. Here and later on, for every indices i and j in Z with i ^ j and every sym- 
bol S, the shorthand Si-j denotes (Sk)i^k^j- Finally, if £ in is distributed 
according to 7T, the empirical frequencies of any word w in £, observed along 
any increasing sequence of intervals of Z, almost surely converge to n(w). 

All of the above properties stem from the following representation of the 
distribution ir. There exists an i.i.d. sequence (£j)igz of Poisson processes, 
and a measurable map with values in g/, such that if one sets 

for every site i in Z, then the distribution of (Sj)j g z is 7T. In particular, any 
collections (Ej)j e j and (Hj)j g j are independent as soon as the subsets I and 
J of Z are such that \i — j\ ^ 3 for every sites i in / and j in J. We call this 
property 2-dependence. 

1.3. Notations. Our estimators are based on various quantities provided 
by the alignment of the two sequences. 

/= 1 A r = 7 
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Figure 1. Alignment of an ancestral sequence and a present one 

For every i ^ and every word w of length i + 1 written in the alphabet gf , 
say that site i is occupied at time t by w if Xj : j + ^(t) = w. For every triple of 
subsets W, W and W" of words and every couple of times t and s, (W)(t) 
denotes the frequency of sites occupied by any of the words in W at time t, 
that is 

1 N 

(W)(t) = lim - V V Hi(t,w), where Hi(t,w) = l{X l:l+E (t) = w}, 
N-&00 iv L — ' L — ' 

and (W, W')(t) the frequency of sites occupied by any of the words in W at 
time and any of the words in W' at time t, that is 

1 N 

(W,W')(t) = Im^-Y, E E Hi{0,w)Hi{t,v/). 
i=0weWw'ew 

The limits above exist thanks to the ergodicity of it with respect to transla- 
tions. 
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When comparing two present sequences, we use the following notations. For 
every sets W and W' of words and every time t, [W,W'](t) denotes the 
frequency of sites occupied by a word of W in the left sequence (denoted by 
X 1 ) and by a word of W in the right sequence (denoted by X 2 ). 

We identify a word w and the set of words {w}. For every letter x in 
the alphabet g/, we use the shorthands *x = s/ x {x}, x* = {x} x s/, 
x * x = szf x {x} x £/ and x = s/ \ {x}. 

2. Summary of main results 

Our main result is theorem 12,41 below, which provides asymptotic confidence 
intervals for an estimation procedure of the time elapsed between a present 
sequence and an ancestral one and for the time since two present sequences is- 
sued from the same ancestral sequence diverged, for the Jukes-Cantor model 
with CpG influence (JC+CpG) of intensity r. These intervals are based on 
two consistent estimators of the elapsed time and two consistent estimators 
of the time of divergence. 

Our first estimator is based on the evolution of the frequency (C, C)(t) when 
the time t varies and the second one on the evolution of (A,A)(t). These 
estimators match the classic ones used for the original Jukes-Cantor model 
when r = 0. The symmetry of the roles played by A and T, or by C and G 
in the JC+CpG model immediately gives the relations (A, A)(t) = (T,T)(t) 
and (G,G)(t) = (C,C)(t). 

Our estimators for the divergence time are based on the evolution of the 
frequency [C, C](t) when the time t varies and on the evolution of [A, A](t). 
Even if the results are given in the same theorem, there is a substantial 
difference between [C, C] and [A, A]. Indeed, as we explain in sections [5] 
and|6] 

Theorem 2.1. In the JC+CpG model, for every positive t, 
[C,C](t) = (C,C)(2i), [A,A](t) + (A,A)(2t). 

In the appendix, theorem IB . 1 1 provides an asymptotic confidence interval for 
our estimation procedure of the time elapsed between a present sequence 
and an ancestral one, for RN+YpR models, under the condition that the 
estimator is well-defined in the general case. 

The keystep for the creation of phylogenetic trees built by a distance-based 
method is theorem 12.41 below. At the moment, a prior knowledge of the 
parameter r is needed to apply the method. We hope in the future to perform 
simulations and/or to find a mathematical method to estimate parameter r. 

We now introduce some notations needed to state theorem 12.41 and used in 
the rest of the paper. 
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Definition 2.2. Let (x,x) b s and [x,x] ODS denote for every x E {^4, C} the 
observed value of (x, x) and [x, x] on two aligned sequences of length N, that 
is, 

with Kf{t) = l{Xi(0) = Xi(t) = x}, 



with Kf(t) = l{X}{t) = Xf(t) = x}. 
In figure [1] for instance, N = 7 and (C, C%bs = f • 

Definition 2.3. Let T x and T x denote the estimators of the elapsed time 
and the divergence time respectively, defined for every x 6 {A, C}, as the 
solution in t of the equations 

(x,x)(t) = (x,x) ohs and [x, x](t) = [x, x] ohs . 

For x G {A,C}, let n^ hs , K* hs , i/*, and v^, denote observed quantities, 
defined as 

K obs 

= 4(C, C) ohs + r(C*, CG) ohs - (C)obs, 
= 4{A, A) ohs - r(*A, CG) ohs - (A) ohs , 

^obs = ( x i ^)obs ~~ x) obs + 2(xX, Xx) b s + 2(x * X,X * x) b S) 

and 

K obs = % K obB> fo bs = V^ hs . 

We note that K^. , K^ hs , v® b and t^l may be negative for some sequences 
of observations and some lengths N. However, from lemma I4TT1 in section 31 
K obs' ^obs' ^obs anc ^ ^obs are a l m °st surely positive when A~ is large. 
As explained in sections [5] and (HJ in the JC+CpG model, for every x £ 
{A, C}, the functions 

t t-t (x, x)(t), and t [x,x](t), 

are decreasing functions of t ^ 0, from (x)* at t = to (x)* at i = +oo, 
where (x)* denotes the frequency of x at stationarity. Thus, T x and T x are 
unique and well defined for any pair of aligned sequences such that 

(x)l < (x,x) obs < 

Thanks to the ergodicity of the model, this condition is almost surely satisfied 
when TV" is large enough because (x, x) G bs — > (x, x)(t) and [x, x] Q b s — > [x, x](t) 
almost surely when A" — > oo. 

However, even if T x and T x are unique and well defined, the formulas to 
compute them are not straightforward since they involve inverting a function. 
Thus, to solve equation (x,x)(t) = (x,x) b s , for example, one has to rely on 



1 N 

(x,x) obs = jj^Kfit), 



and 



i=l 



N 



i=l 
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numerical methods. Fortunately, explicit formulas for (x,x)(t) and [x, x](t) 
in the JC+CpG model do exist. 

We now state our main result. 

Theorem 2.4. Assume that the ancestral sequence is at stationarity. Then, 
in the JC+CpG model, for every x 6 {^4, C}, when N — > +oo, 



<bVW r *-*) and 4v«( T c-*) 

6oi/i converge in distribution to the standard normal law. An asymptotic 
confidence interval at level e for the elapsed time is 



< bs V N K* h J N 

An asymptotic confidence interval at level e for the time of divergence is 



t - z ( £ h r** t z ( e \ Mb* 
x kJ n > Tx + k*J n 

In both formulas, z{e) denotes the unique real number such that P(|Z| ^ 
z(e)) = e with Z a standard normal random variable. 

Theorem 12.41 implies that, for large N, the width of the confidence interval 
scales as iV -1 / 2 times a function of t, and that, for large t, this function scales 
as e 4 * for the time elapsed and as e 8 * for the time of divergence, according 
to formulas given in corollaries 15.21 and 16.21 Heuristically, this means that, 
to estimate the time t up to a given factor, one must observe a part of the 
sequence of length iV at least of order e 8 * for the time elapsed and at least 
of order e 16 * for the time of divergence. 

The rest of the paper is organized as follows. In section [3j we state cen- 
tral limit theorems for the time estimators for the Jukes-Cantor model with 
CpG influence and for the general model under conjecture 13.41 In section HJ 
we show that the central limit theorems established in section |3] imply the- 
orem 12.41 of section [2j In section [5j and [6j we characterize the evolutions 
of (C, C)(t) and [C, C](t), and in section |6] the evolutions of (^4, j4)(t)and 
[A, A](t). We state some monotonicity properties in these two sections. 

In appendix [A] we give a short description of the general RN model with 
YpR influence. In appendix IB"] we give an extension of theorem 12.41 to the 
general model under conjecture 13.41 and in appendix O the justification of 
this extension. In appendix [Dj we describe some simulations supporting our 
conjecture 13.41 



3. Central limit theorems for time estimators 



We give here central limit theorems for the time estimators in the gen- 
eral model. The strategy is the following. We first deal with (x, x) h> s and 



8 



MIKAEL FALCONNET 



[x,x] bs- We compute exactly the variance of these quantities thanks to the 
2-dependence. Then, we use a central limit theorem for mixing sequences. 
To state central limit therorem for the time estimators, we use the delta 
method, and to do that, we need to know that t i-> (x, x)(t) and t h-> [x, x](t) 
are diffeomorphisms. This is still a conjecture for the general model whereas 
we prove it for the JC+CpG model. 

3.1. Variance computations. We detail the properties of (C, C) bs> (A A) \j S , 
[C, C] bs and [A, A} ^ s . We assume that N ^ 2. 

Lemma 3.1. In the general RN+YpR model, for x € {C,A}, the mean 
of (x, x)obs> respectively [x,x] b s; with respect to tt is (x,x)(t), respectively 
[x,x](t). 

The variances of (x, x) G b s and [x, x] Q b s with respect to tt are both equal to 
a 2 (N,t), where 

Na 2 x (N,t) =(x,x)(t) - (x,x)(t) 2 + 2(1 - l/N)((xx,xx)(t) - (x,x)(t) 2 ) + 
+ 2(1 - 2/N)((x*x,x*x)(t) - {x,x){t) 2 ). 

Proof. The random variables (Kf(t))i e %, respectively (Kf (t))jgz, are Bernoulli 
random variables identically distributed with respect to tt, their common 
mean is (x,x)(t), respectively [x, x](t), and (x,x) b s , respectively [x,x] Q b s , is 
the empirical mean of the N values Kf(t), respectively Kf(t), for i from 1 
to N. Thus, we obtain the value of E((x, x) b s ), respectively E([x, x] bs)> as 
(x,x)(t), respectively [x, x](t). Furthermore, 

N 

N 2 al(N,t) = Y,™r(Kf(t)) + 2 £ cov(Kf (t), K*{t)). 

i=l lsii<j^N 

The variance of each Kf(t) is var(Kf(t)) = (x,x)(t) - (x,x)(t) 2 . The 3- 
dependence, valid for any RN+YpR model, implies that each covariance for 
\i — j\ ^ 3 is zero. The invariance by translation of tt, valid for any RN+YpR 
model, shows that each of the (N — 1) covariances such that i = j — 1 is 

cov(Kf(t),K$(t)) = (xx,xx)(t) - {x,x)(t) 2 . 

Finally, each of the (N — 2) covariances such that i = j — 2 is 

cov(Kf(t),K$(t)) = (x*x,x*x)(t) - {x,x)(t) 2 . 

The same arguments hold for the variance of [x, x] Q b s - This concludes the 
proof. □ 

3.2. Central limit theorems for (x, x) Q b s and [x, x] b s - To prove the con- 
vergence in distribution to the normal law, we use the following result. 
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Theorem 3.2 (Hall and Heyde [BJ). Let {Vi)i^z denote a stationary, ergodic, 
centered, square integrable sequence. Let = a(Vi ; i ^ 0) denote the a- 
algebra generated by the random variables Vi for i ^ 0. For every positive 
integer n, introduce 

1 n 

Assume that 

(i) for every positive n, the series E(VfcE(V^|J^o)) converges, 

(ii) the series |E(VfcE(l^ l | t ^o))| converges to zero when n — > +oo, 
uniformly with respect to K. 

Then ~K{U 2 ) converges to a real number a 2 ^ when n — > +oo. Furthermore, 
if a 2 > 0, then Un/Va 2 converges in distribution to the standard normal 
distribution. 

Proposition 3.3. In the general RN+YpR model, for x G {C,A}, when 
N — > +oo, ^fN((x, x) bs — (x,x)(t)) and y/N([x, x] bs — [x,x](t)) both con- 
verge in distribution to the centered normal distribution with variance o~ 2 (t), 
where 

(T 2 (t) = (x, x)(t) + 2(xx, xx)(t) + 2(x * x, x * x)(t) — 5(x, x)(t) 2 . 

Proof. For any RN+YpR model, for x 6 {C,A}, the sequence (K?(t))i£z, 
respectively (Kf)i^z, is stationary and ergodic. Let V? = Kf(t) — (x,x)(t), 
respectively Vf = Kf — [x,x\{t). This defines a sequence (V i x )i e %, respec- 
tively (y^)j g ^, such that the first hypothesis of theorem 13.21 holds. We 
now check conditions (i) et (ii). The 3-dependence, valid for any RN+YpR 
model, implies that, for every n ^ 3, E(V^|^q ) = E(V^) = 0, respectively 
E(VJf |J^q ) = E(l/^') = 0. Hence we only have to check the cases n = 1 and 
n = 2. 

For every k ^ 3, V£ , respectively V£ , is independent of J^"q , respectively 
J^o, and E(V^f|J?o ), respectively E(VJf |J^q ), is J^q -measurable, respectively 
J^q -measurable, hence 

E(^E(V^)) = E(V^)E(E(l/ r f|^)) = 0, 

and 

nV^{VnWE)) = E(V^)E(E(V-|.^)) = 0. 
This implies (i) and (ii), hence theorem 13.21 applies. 

To compute the asymptotic variance in the theorem, we note that the vari- 
ances of y/N{(x, x) bs — (x,x)(t)) and y/~N([x, x] b s — [x,x](i)) are both 
Na 2 (N,t), which converges to o~ 2 (t) when N — > +oo. □ 
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3.3. Central limit theorems for T x and T x . We describe explicitly the 
behaviour of T x — t and T x — t. To state our result, we use the central limit 
theorems given in proposition 13.31 but we now need to treat separately the 
JC+CpG model and the general RN+YpR model. 

For x G {C, A}, let /j, x , respectively Jl x , denote the inverse function of t t— > 
(x,x)(t), respectively t h-» [x, x](t). That is, 

t = n x ((x,x)(t)) = Jl x ([x,x](t)), 

and fj, x and Jl x are both defined on the interval ((x)*, {x)*]. 

From propositions 15.3} 16.31 and 16. 4} the functions t t— >■ (x,x)(t) and t i— > 
[x, x] (t) are diffeomorphisms in the JC+CpG model. In the general RN+YpR 
model, this is only a conjecture, supported by simulations described in ap- 
pendix [DJ showing that indeed, the function t 1— >■ (C, C)(t) is decreasing. 

Conjecture 3.4. In the RN+YpR model, for x G {C, ^4}, the functions 1 1- > 
(x,x)(t) and t ^ [x, x](t) are diffeomorphisms from [0, +oo) to ((x)*,(x)*]. 

Then, 

T x = fi x ((x,x) ohs ) and t = fi x ((x, x)(t)), 

and 

T x = Jl x ([x,x] ohs ) and t = Jl x ([x, x](t)). 
Besides, the derivatives of \i x and Jl x , with respect to t are 

fx' x ((x,x)(t)) = — — and ]? ([x, x)(t)) 



(X,x)'(t) ™ , ,v n [x ^y {t) 

Using the delta method, see one gets the following result. 

Proposition 3.5. In the IC+CpG model, for x G {C, A}, when N — > 
+oo, y/~N(T x — t), respectively ^/N(T X — t), converges in distribution to 
the centered normal distribution with variance a x (t)/(x, x)'(t) 2 , respectively 
al(t)/[x,x]'{t)\ 

Under conjecture \3.4\ the same results hold for the RN+YpR model. 



4. Proofs of the results of section [2] for JC + CpG models 

Proposition 13.51 yields the variation of T x and T x around t for x G {C,A}. 
A priori, to build a confidence interval for t from this proposition requires 
to know the value of (x, x)'(t), respectively [x,x]'(t), and of o~ x (t), which all 
depend on the quantity t to be estimated. 

As is customary, Slutsky's lemma (see allows to bypass this difficulty 
through the observed quantities «^ bs and ^ hs , respectively K^ bs and KJL,, 
defined in section [2j Indeed, Slutsky's lemma states that if two sequences 
of random variables (Xn)n and (Yzv)iV are such that (Xjv)jv converges in 
distribution to a random variable X and (Ijv)jv converges in probability to 
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a constant c, then the sequence (XnYn)n converges in distribution to the 
random variable cX. 

Lemma 4.1. In the JC+CpG model, for x G {C, A}, K® h — > — (x,x)'(t), 
^obs ~~ ^ ~\ x -> x Y(t) an d v ohs ~* a x(t) almost surely when N — > +00. 

Proof. The equalities 

(C, C)'(t) = -4(C, C){t) - r(C*, CG){t) + (C%, 
(A, A)'(t) = -4(A, A){t) + r(*A, CG)(t) + (A)*, 

given in sections [5] and [6j and the almost sure convergence of the observed 
quantities (C7,C7) obs , (C*,C7G) obs , (CC,CC) ohs , (C * C,C * C) ohs , (A,A) obs , 
(*A, CG) \> S , (AA, AA) b s and (A * A, A * ^4) bs to the corresponding limit- 
ing values, when N — > +00, imply the desired convergences. Likewise, the 
equalities 

[C,C]'(t) = -8[C,C](t) - 2r[C*,CG](t) + 2(C)„ 
[A, A]'{t) = -8[A, A)(t) + 2r[*A, CG)(t) + 2(A)*, 

imply the convergence of K^ hs . □ 

We apply Slutsky's lemma to the sequence (Xn) = (^/N(T X — t)), respec- 
tively (X]\f) = (\/ r N(T x — t)), which converges in distribution to the centered 
normal law with variance o~ x (t)/(x, x)'(t) 2 , respectively o~ x (t)/[x, x]'(t) 2 , from 
proposition [33J an d the sequence (Y/v) = ( K obs/ V^obs)' respectively (Y/v) = 
(^obs/ "v/^obs)' w hi cn converges in probability to — (x, x)' (t)/ a x (t), respec- 
tively — [x, x]'(t)/o~ x (t), from lemma |4~T1 This implies theorem 12.41 

5. Evolutions of (C,C)(t) and [C,C](t) in JC+CpG models 

In the JC+CpG model, dinucleotides coded as {C, C} x {G, G} have au- 
tonomous evolution with the following 4x4 rate matrix Q: 







CG 


CG 


CG 


CG 


CG 


( 


-(6 + 2r) 


3 + r 





3 + r\ 


CG 




1 


-4 


3 





CG 







1 


-2 


1 


CG 


V 


1 





3 


-4 ) 



The dynamics of the dinucleotides can be represented with the graph given 
in figure [2j 

The exponential of the corresponding matrix can be explicitly computed. 
One can also compute explicitly the stationary frequencies of dinucleotides 
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3 + r 



©. 



3 + r 



FIGURE 2. Dynamics of dinucleotides encoded as {C,C} x {G, G} 

coded as {C,C} x {G, G} using the same matrix. That is 

1 3 + r 



(CG% 
(CG)* 



16 + 5r : 
9 + 3r 



(CG% 



16 + 5r' 
3 + r 



16 + 5r 16 + 5r 

These stationary frequencies have already been derived in [1] by Berard, 
Gouere and Piau. 

We observe that (C, C)(t) can be expressed as a linear combination of terms 
of the form (XY, ZT)(t) where (X, Y) and (Z, T) belong to {C, C} x {G, G}. 

It is then clear that an explicit expression for (C, C)(t) can be obtained, and 
that an expression of (C, C)'(t) in terms of dinucleotide frequencies holds. 

Proposition 5.1. The evolution of (C,C)(t) satisfies the linear differential 
equation 

(C, C)'(t) = -4(C, C)(t) - r(C*, CG)(t) + (C)(0). 

Proposition 15.11 is valid out of equilibrium. We use it at stationarity hence, 
in particular, for the initial values 

4 + r 1 



(C)(0) = (C% 



(CG)(0) = (CG), 



16 + 5r 16 + 5r 

The equation in proposition 15.11 yields expressions of (C, C)(t). Consider the 
positive real numbers u, u+ and n_ defined as 

u = yi + 2r + r 2 , u+ = 6 + r + u, 
Corollary 5.2. /n i/te stationary regime, 



U- 



6 + r — u. 



(C, C)(t) = c e~ 4t + c + e-"+* + c_e-"-' + (C)l 



with 
co = 



3 + r 
2(16 + 5r) 



and c± 



3 + r 
4n(16 + 5r) 



16 + 3r) + (32 + 14r + 3r 2 )) . 
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As expected 



c + + c_ + c = (C% 



(C)l 



Furthermore, for every positive r 



4 < u_ < 5 < 2r + 7 < u. 



•+ 



< 2r + 8. 



Although the JC+CpG model is not reversible, the dynamics of dinucleotides 
encoded as {C, C} x {G, G} with respect to this model is reversible. This 
can easily be checked by looking at the cycles in figure El 

Reversibility means that the dynamics will look the same whether time runs 
forward or backward. As a result, given two sequences at stationarity, the 
probability of data in a state is the same whether one sequence is ancestral 
to the other or both are descendants of an ancestral sequence at stationarity. 
Roughly speaking, for every (X,Y) and (Z,T) that belong to {C,C} x 
{G, G}, going from a XY at time t to then back to a ZT at time t on 
another branch, is equivalent to going from a XY to at time to a ZT at 
time 2t. 

As a consequence, for every positive t, we have 



For every positive r, the parameters c± and cq, are positive. This proves the 
following proposition. 

Proposition 5.3. In the JC+CpG model, the functions t h-> (C,C)(t) and 
1 1 — y [C,C](t) are decreasing diffeomorphisms from [0, +oo) to ((C)1, (C%] . 

6. Evolutions of (A,A)(t) and L4,A](i) in JC+CpG models 

Like we did to study (C, C), it is possible to encode dinucleotides such that 
under the JC+CpG model, (A, A) is a linear combination of terms involved in 
an autonomous evolution. It suffices to encode the dinucleotides as {C,C} x 
{A, G, Y}, and the dynamics can be represented with the graph given in 
figure [3l 

However, we don't use this encoding to compute (A,A)(t). Indeed, the 
evolution matrix associated to this encoding is a 6 x 6 matrix whereas it is 
possible to deal with the 4x4 matrix Q, defined in section to state the 
evolution of (A, A) as explained below. 

We choose to present this encoding because it is a way to understand the 
difference between the role of C and A in the Jukes Cantor model with CpG 
effect. Indeed, the dynamics of dinucleotides encoded as {C, C} x {A, G, Y} 
is not reversible. This can be checked by looking at the cycle CA^t CY — > 
CG —> CA in figure [3l As a consequence, even if the non-reversibility of 
the dynamics does not strictly prove that the identity [A, A](t) = (A,A)(2t) 
never holds when r > 0, the non reversibility of the dynamics can explain 



[C,C](t) = (C,C)(2t). 
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Figure 3. Dynamics of dinucleotides encoded as {C, C} x {^4, G, Y} 

why such an identity is unlikely to be true, and in fact, unlike [C, C], as soon 
as r > and t > 0, 

[A,A](t)^(A,A)(2t). 

We strictly explain this fact at the end of the current section. Now, we 
describe a way to state the expression of (A, A)(t). Given that there are only 
three distinct set of two-letter configurations leading to different transition 
rates to *A, that is, *A, CG, and the complement of these two, the following 
result is easy to derive. 

Proposition 6.1. The evolution of (A, A)(t) satisfies the linear differential 
equation 

(A,A)'(t) = -4(A,A)(t) + r(*A,CG)(t) + (A)(0). 

Let U (t) denote the time dependent vector defined as 

f(*A,CG)(t)\ 

(*A,CG)(t) 

(*A,CG)(t) ' 
\(*A,CG)(t)J 

then we have, as a straightforward consequence of the encoding {C, C} x 
{G, G}, 

U'(t)= t QU(t). 

We can now compute (*A, CG)(t), infer the value (^4)* of (^4)(0) at station- 
arity and finally state the expression of (^4, A)(t). 

Corollary 6.2. In the stationary regime, 

(A, A)(t) = aoe" 4 * + a + e~ u+t + a.e""-' + (A)l 

with 

80 + 31r 
a °" 32(16 + 5r)' 
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and 

_ 512 + 384r + 106r 2 + 13r 3 =F u(256 + 18r + 13r 2 ) 
a± ~ 64u(16 + 5r) 2 ' 

For every positive r, the parameters a± and do, are positive. This proves the 
following proposition. 

Proposition 6.3. In the stationary JC+CpG model, the function 1 \— > (A, A)(t) 
is a decreasing diffeomorphism from [0, +oo) to ((A)1, (A)*] . 

We deal now with the evolution of [A, Extending the strategy used to 

prove proposition 16.31 one can also derive an explicit expression (not stated) 
for [A, A](t), which turns out to be different from (A, A){2t). Indeed, the 
computation under Maple shows that the coefficients of e~ v+t and e~ v ~ t , 
where v± = 10 + r ± u, in the expression of [A, A]{t) are nonzero. This 
fact alone proves that [A, A](t) can't be equal to (A, A)(2t). However, we 
observe on an exemple that the two quantities are very close as one can see 
on figure HI 



FIGURE 4. Representation of 1 1-> [A, A)(t)— (A, A)(2t), when 
r = 10 

We do not provide the expression of [A, A](t), however it seems that the 
following conjecture holds. 

Conjecture 6.4. In the JC+CpG model, the function t i->- [A, A](t) is a 
decreasing diffeomorphism from [0, +oo) to ((A)1, (j4)*]. 
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Appendix A. Short description of the RN model with YpR 

INFLUENCE AND NOTATIONS 

Firstly, RN stands for Rzhetsky-Nei and means that the 4x4 matrix of 
substitution rates which characterize the independent evolution of the sites 
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must satisfy 4 equalities, summarized as follows: for every pair of nucleotides 
x and y ^ x, the substitution rate from x to y may depend on x but only 
through the fact that x is a purine (^4 or G, symbol R) or a pyrimidine (C or 
T, symbol Y). For instance, the substitution rates from C to A and from T 
to A must coincide, likewise for the substitution rates from A to C and from 
G to C, from C to G and from T to G, and finally from A to T and from G 
to T. The 4 remaining rates, corresponding to purine-purine substitutions 
and to pyrimidine-pyrimidine substitutions, are free. 

Secondly, the influence mechanism is called YpR, which stands for the fact 
that one allows any specific substitution rates between any two YpR dinu- 
cleotides (CG, CA, TG and TA) which differ by one position only, for a total 
of 8 independent parameters. The Jukes-Cantor model with CpG effect is 
the simplest non trivial one: the only YpR substitutions with positive rate 
are CG — > CA and CG — > TG, and both happen at the same rate. 

Recall that Y denote the set of pyrimidines defined as Y = {T,C}, and R 
the set of purines defined as = {A, G}. 

The 4x4 matrix of substitution rates which characterize the independent 
evolution of the sites in RN model is given by 





A 


T 


C 


G 




A 


( ■ 


vt 


vc 


w G 


\ 


T 


VA 




w c 


V G 




C 


VA 


wt 




V G 




G 


\WA 


vt 


Vc 




) 



The influence mechanism called YpR adds specific rates of substitutions from 
each YpR dinucleotide as follows. 

• Every dinucleotide CG moves to CA at rate and to TG at rate 

• Every dinucleotide TA moves to CA at rate and to TG at rate 

T 
r G- 

• Every dinucleotide CA moves to CG at rate Tq and to TA at rate 

• Every dinucleotide TG moves to CG at rate r^J and to TA at rate 

T 
r A- 



Appendix B. Extension of theorem 12.41 to the RN model with 

YpR influence 

Under conjecture 13.41 it is possible to generalize theorem 12.41 by suitably 
generalizing the definitions of k and v given in section |2j Introduce the 



PHYLOGENETIC DISTANCES FOR MODELS WITH INFLUENCE 



17 



parameters 

K obl = ~ v c(C, A) ohs - w c (C, T) obs + (v A + w T + v G )(C, C) obs - v c (C, G) ohs 
„Arn, rr as jGin, r^s , A, 



r£(C*, TA) ohs - rg(C*, TG) ohs + r#(C*, CA) ohs + r<?(C*, CG) 



obs • 



RN 
u obs 



v obs- 



,-.Cj 

C 



\}bs' 



and rtf 



r, which 



When vc = wq = va = wt = vq 
is the case in the JC+CpG model, 

On the other hand, the observed quantity z/? is unchanged between JC+CpG 
models and RN+YpR models because lemma [3TT1 holds in the general case. 

Once again, Slutsky's lemma, through the observed quantities and 
is the key to state theorem IB. II below, which is a consequence of proposi- 
tion! 



Theorem B.l. Assume that the ancestral sequence is at stationarity and 

that conjecture \3J\ holds. Then, when N -)• +oo 7 k§£ J n /v™( t C ~ t) 
converges in distribution to the standard normal law. An asymptotic confi- 
dence interval at level e for t is 




,RN 



jibs_ rp _|_ Z[£) 



%bs 




where z(e) denotes the unique real number such that f(\Z\ ^ z(e)) = £ with 
Z a variable with standard normal law. 



As in the JC+CpG model, the estimator Tq is defined implicitly for RN+YpR 
models. We do not provide an explicit formula for (C, C)(i) in the general 
model, but there are numerical methods to compute a closed form of the 
theoretical solution of the differential linear system, and consequently it is 
possible to solve equation (C, C)(t) = (C, C%bs with numerical methods. 



Appendix C. Evolution of (C,C)(t) in RN+YpR models 

We base our description of the method in the general RN+YpR model on the 
encoding of dinucleotides as {R,T,C} x {Y, G, A}, which has autonomous 
evolution. 

The detailed description of the corresponding 9x9 matrix is given below as 
m(uv, xy), where uv and xy are generic elements of the alphabet. 

Let vr and vy denote the quantities defined as 

VR = VA + V G , VY=V T + V C - 
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Then, 



m(uv, xy) 


= o, 


if and v ^ y; 




m(Rx, ux) 


= v u , 


if 


x G {Y, G, A} and 


uG{C,T}; 


m(ux, Rx) 


= VR, 


if 


x £ {Y, G, A} and 


u€{C,T}; 


m(Ru, Rv) 


= W v 


if 


{u,v} = {A,G}; 




m(xY, xu) 


= v u , 


if 


x G {R, C, T} and 


u€{A,G}; 


m(xu, xY) 


= VR, 


if 


x G {R, C, T} and 


u€{A,G}; 


m(uY, vY) 


= W v 


if 


{u,v} = {T,C}; 




m(xu, xv) 


= W v 


+ r%, 


if {u,v} = {A,G} 


and x £ {T, C} 


m(ux, vx) 


= W v 


+ C 


if {u,v} = {C,T} 


and x G {A, G} 



It is then clear that quantities such as (C,C)(t) can be computed provided 
one computes the exponential of the rate-matrix, and that quantities such 
as (C, C)'(t) have computable explicit expressions in terms of frequencies 
expressed in the coded dinucleotide-alphabet {R,T,C} x {Y, G, A}. 

Appendix D. Simulations 

As a support to the conjecture that t \— > (C,C)(t) always defines a diffeo- 
morphism in the general RN+YpR model, we performed some simulations. 
We give the range of parameter values that we explored and one example of 
figure obtained for one set of parameters, here a Kimura model with CpG 
influence. The code is available on the website 

http : / /www-four ier . uj f -grenoble . f r/~mikael . f / en/ recherches . htm 



D.l. Range of parameter values explored. 
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1 
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WA 
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3 
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Vt 
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1 
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0.3 


Wt 
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0.3 


0.3 
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v c 
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Wc 
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0.3 


0.3 
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V G 
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1 
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1 
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10 


W G 
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3 


0.3 


0.3 


3 
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0.1 


r A 


10 


10 


10 


10 


0.3 


10 


10 


r% 


10 


10 


10 


10 


0.3 


10 


5 


r c 











10 


0.3 


5 


1 


r G 











10 


0.3 


5 


0.5 


r G 











10 


0.3 


3 


20 


r£ 











10 


0.3 


3 


3 


rg 











10 


0.3 


1 


0.3 


T 
r A 











10 


0.3 


1 


0.1 



D.2. One example of figure performed on Maple. Figure [5] illustrates 
a simulation performed with the parameter values 

va = vt = vc = vq = 1, wa = wt = wc = wq = 3, 

r° A =4 = 10, r c = r T G = rg = 4 = rg = r T A = 0. 
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This is a Kimura model with CpG influence. The function t i— >■ [A, A](t) is 
represented on the interval [0,2]. 

O.I 8 - 



0.1 6 



0.1 4 



0.08 



0.06 



0.04 




O 0.5 1 1 .5 2 



Figure 5. One simulation of the function t \— > \A,A]{t) on 
the interval [0, 2] 



References 

[1] J. Berard, J.-B. Gouere, and D. Piau. Solvable models of neighbor-dependent nu- 
cleotide substitution processes. Mathematical Biosciences, 211:56-88, 2008. 

[2] L. Duret and N. Galtier. The covariation between TpA deficiency, CpG deficiency, 
and G+C content of human isochores is due to a mathematical artifact. Molecular 
biology and evolution, 17:1620-1625, 2000. 

[3] J. Felsenstein. Evolutionary trees from DNA sequences : A maximum likelihood 
approach. J. Mol. Evol, 17:368-376, 1981. 

[4] N. Galtier, M. Gouy, and C. Gautier. Seaview and phylowin, two graphic tools for 
sequence alignment and molecular phylogeny. Comput. Applic. Biosci., 12:543-548, 
1996. 

[5] P. Hall and C. C. Heyde. Martingale limit theory and its applications. Academic Press, 
New York, 1980. 

[6] M. Hasegawa, H. Kishino, and T. Yano. Dating of the human-ape splitting by a 
molecular clock of mitochondrial DNA. J. Mol. Evol, 22:160-174, 1985. 

[7] J. L. Jensen and A. Pedersen. Probabilistic models of DNA sequence evolution with 
context dependent rates substitution. Adv. Appl. Prob., 32:459-517, 2000. 

[8] D. Jones, W. Taylor, and J. Thornton. The rapid generation of mutation data matrices 
from protein sequences. Comput. Appl. Biosci., 8:275-282, 1992. 

[9] T.H. Jukes and C.R. Cantor. Mammalian protein metabolism, chapter Evolution of 
Protein Molecules, pages 21-132. Academic Press, New York, 1969. 
[10] M. Kimura. A Simple Method for Estimating Evolutionary Rates of Base Substitu- 
tions Through Comparative Studies of Nucleotide Sequences. J. Mol. Evol, 10:111— 
120, 1980. 

[11] A. W. van der Vaart. Asymptotic Statistics. Cambridge University Press, 1998. 



Universite Joseph Fourier Grenoble 1, Institut Fourier UMR 5582 UJF- 
CNRS, 100 rue des Maths, BP 74, 38402 Saint Martin d'Heres, France 



